Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metadata): enhance retry logic for metadata server access in _metadata.py #1545

Merged
merged 4 commits into from
Jul 12, 2024

Conversation

mrvarmazyar
Copy link
Contributor

@mrvarmazyar mrvarmazyar commented Jul 1, 2024

This PR enhances the retry logic in the _metadata.py file to handle transient network issues more effectively when accessing the GCE metadata server. The existing retry logic in the ping and get functions have been replaced with the ExponentialBackoff class from _exponential_backoff.py, which introduces exponential backoff and jitter.

@mrvarmazyar mrvarmazyar requested review from a team as code owners July 1, 2024 15:06
@mrvarmazyar mrvarmazyar force-pushed the enhanced-retry-logic-metadata branch from 05be25a to c29f0f6 Compare July 1, 2024 15:10
@mrvarmazyar mrvarmazyar changed the title Enhance Retry Logic for Metadata Server Access in _metadata.py fix(metadata) Enhance Retry Logic for Metadata Server Access in _metadata.py Jul 1, 2024
@clundin25
Copy link
Contributor

Hi @mrvarmazyar, thank you kindly for your contribution!

There exists an exponential backoff implementation in this repo. I think the loop should instead be refactored to use it https://github.com/googleapis/google-auth-library-python/blob/main/google/auth/_exponential_backoff.py.

This reduces new code + adds jitter to the retries.

@clundin25 clundin25 self-assigned this Jul 1, 2024
@mrvarmazyar mrvarmazyar force-pushed the enhanced-retry-logic-metadata branch from c29f0f6 to e64f1ac Compare July 1, 2024 17:22
@mrvarmazyar
Copy link
Contributor Author

mrvarmazyar commented Jul 1, 2024

Hi @mrvarmazyar, thank you kindly for your contribution!

There exists an exponential backoff implementation in this repo. I think the loop should instead be refactored to use it https://github.com/googleapis/google-auth-library-python/blob/main/google/auth/_exponential_backoff.py.

This reduces new code + adds jitter to the retries.

@clundin25 Thank you for your feedback and for pointing me towards the ExponentialBackoff implementation. I have updated the _metadata.py file to use the ExponentialBackoff class for retry logic, reducing the new code and adding jitter to the retries as suggested.
I also tested it, and it succeeded.

PS: there was also a miss typo in the contribution document

@clundin25
Copy link
Contributor

I think there are two actions remaining.

  1. Update the unit tests to mock out sleep so they don't take longer than necessar.
  2. There is a subtle bug in this change. ExponentialBackoff will sleep on the first iteration so this change would introduce ~1s of latency before the first request is sent.

I think ExponentialBackoff should be modified to not sleep on the first iteration, to avoid accidentally adding latency to requests.

Let me chat with my team on ExponentialBackoff. After that I will likely submit a patch to modify it's behavior.

@mrvarmazyar
Copy link
Contributor Author

mrvarmazyar commented Jul 1, 2024

I think there are two actions remaining.

  1. Update the unit tests to mock out sleep so they don't take longer than necessary.

Thanks for the suggestion, I'll take a look to learn a bit of its unit tests and then will push the new changes.

  1. There is a subtle bug in this change. ExponentialBackoff will sleep on the first iteration so this change would introduce ~1s of latency before the first request is sent.

I think ExponentialBackoff should be modified to not sleep on the first iteration, to avoid accidentally adding latency to requests.

Let me chat with my team on ExponentialBackoff. After that I will likely submit a patch to modify it's behavior.

I've reviewed the code and understand your point. While you chat with your team, I'll work on proposing the change in the first iteration.

@clundin25
Copy link
Contributor

I pushed two patches here. Going to spend some more time thinking about this tomorrow.

@mrvarmazyar mrvarmazyar changed the title fix(metadata) Enhance Retry Logic for Metadata Server Access in _metadata.py fix(metadata): enhance retry logic for metadata server access in _metadata.py Jul 2, 2024
@mrvarmazyar
Copy link
Contributor Author

I pushed two patches here. Going to spend some more time thinking about this tomorrow.

@clundin25 I have reviewed your code and since you also made adjustments to the tests, should I make any other changes to my code?

@clundin25
Copy link
Contributor

@mrvarmazyar This LGTM. I will update the tests to mock out the calls to sleep.

Thank you for your contribution!

@mrvarmazyar
Copy link
Contributor Author

Hello @clundin25, do we have any ETA for merging this PR? I'd like to use it in our Apache Airflow on GKE clusters, and this is why I'm excited to have it 🤩
If I can help you to change the tests to mock out the calls to sleep, I'd be happy to do that.

@clundin25
Copy link
Contributor

Hello @clundin25, do we have any ETA for merging this PR? I'd like to use it in our Apache Airflow on GKE clusters, and this is why I'm excited to have it 🤩 If I can help you to change the tests to mock out the calls to sleep, I'd be happy to do that.

Apologies for the delay. I think this should be released by the end of next week.

@clundin25 clundin25 added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Jul 11, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 11, 2024
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 11, 2024
@clundin25 clundin25 merged commit 61c2432 into googleapis:main Jul 12, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants